Accuracy of simple, initials-based methods for author name disambiguation

نویسنده

  • Stasa Milojevic
چکیده

There are a number of solutions that perform unsupervised name disambiguation based on the similarity of bibliographic records or common co-authorship patterns. Whether the use of these advanced methods, which are often difficult to implement, is warranted depends on whether the accuracy of the most basic disambiguation methods, which only use the author's last name and initials, is sufficient for a particular purpose. We derive realistic estimates for the accuracy of simple, initials-based methods using simulated bibliographic datasets in which the true identities of authors are known. Based on the simulations in five diverse disciplines we find that the first initial method already correctly identifies 97% of authors. An alternative simple method, which takes all initials into account, is typically two times less accurate, except in certain datasets that can be identified by applying a simple criterion. Finally, we introduce a new name-based method that combines the features of first initial and all initials methods by implicitly taking into account the last name frequency and the size of the dataset. This hybrid method reduces the fraction of incorrectly identified authors by 10-30% over the first initial method. For a significant fraction of studies that are based on bibliometric data, as well as for purposes of research evaluation, it is essential to be able to attribute specific bibliographic records to individual researchers. A practical problem with this straightforward step is that there is a certain level of ambiguity in this process, which is known as the author name disambiguation problem. The problem manifests itself in two ways: a given individual may be identified as two or more authors (splitting), or, two or more individuals may be identified as a single author (merging). Here we use the term individual to refer to an actual person, and an author to refer to an entity that results from the author disambiguation procedure. Both the splitting and the merging can happen as the result of the same disambiguation method. The problem of author disambiguation fundamentally arises because personal names are not sufficiently distinct considering the large number of researchers active in most disciplines today. It is further exacerbated by the inconsistent way in which author names are reported in publications. The

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author name disambiguation: What difference does it make in author-based citation analysis?

In this paper, we explore how strongly author name disambiguation (AND) affects the results of an author-based citation analysis study, and identify conditions under which the commonly used simplified approach of using surnames and first initials may suffice in practice. We compare author citation ranking and co-citation mapping results in the stem cell research field 2004-2009 between two AND ...

متن کامل

بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمّعی

Today, digital libraries are important academic resources including millions of citations and bibliographic essential information such as titles, author's names and location of publications. From the view of knowledge accumulation management, the ability to search fast, accurate, desired contents, has a great importance. The complexity and similarity in these resources cause many challenges and...

متن کامل

Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks

Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorsh...

متن کامل

Author Name Disambiguation Using a New Categorical Distribution Similarity

Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., “Jaccard Coefficient”, between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures pe...

متن کامل

Author Name Disambiguation by Using Deep Neural Network

Author name ambiguity is one of the problems that decrease the quality and reliability of information retrieved from digital libraries. Existing methods have tried to solve this problem by predefining a feature set based on expert’s knowledge for a specific dataset. In this paper, we propose a new approach which uses deep neural network to learn features automatically for solving author name am...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Informetrics

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2013